14.1 Undercomplete Autoencoders

An autoencoder whose code dimension is less than the input dimension is called undercomplete.

The learning process: minimizing a loss function

\[L(x, g(f(x)))\]

where L is a loss function penalizingg g(f(x)) for being dissimilar from x, such as the mean squared error.

When the decoder is linear and L is the mean squared error, an undercomplete autoencoder learns to span the same subspace as PCA. In this case, an autoencoder trained to perform the copying task has learned the principal subspace of the training data as a side effect.

In PCA, we apply a transform of higer dimension data \(x \in R^d\) with \(U \in R^{(d*p)}\) where \(d > p\):

\[y = U^Tx\]

To reconstruct the orginal data you can apply:

\[x = Uy\]

You can imagin an simple autoencoder with applying \(U^T\) as encoder and applying \(U\) as decoder. That is the same logic as PCA

Review on info theory

Low possibility: more informational. High possibility: less informational.

Information gain:

\[I = -log P(x)\]

Entropy (expected information gain)

\[H = - \sum P(x)log P(x)\]

KL divergence, which measure the similarity of two distribution:

\[\begin{split}\begin{equation} KL(P || Q) = - \sum P(x)log Q(x) + \sum P(x)logP(x) \\ = \sum P(x) log \frac{P(x)}{Q(x)} \end{equation}\end{split}\]

KL divergence is not . The KL divergence above is of Q with respect to P. There are two important properties of KL divergence:

  1. \(KL >= 0\)
  2. \(KL(P||Q) \neq KL (Q||P)\) (not symmetric)

Variational inference

Suppose x is observation and z is hidden variable. In order to compute the relationship between z and x, we have to compute:

\[\begin{equation} P(x|z) = \frac {P(z|x)p(z)}{p(x)} \end{equation}\]

Computing \(P(x) = \int P(x|z)p(z)dz\) is very challenging (imagin z with very high dimemsion). Sometime it is intractable. There are two ways to deal with this challenge.

We can approach P(z|x) with Q(z). If we choose Q to be tractabke distribution, such as Gaussian or Bernoulli, then we have an approximate distribution that we can compute. So the goal is minimize:

\[\begin{split}\begin{aligned} KL (q(z) || p(z|x)) \\ & = - \sum q(z) log \frac {p(z|x)}{q(z)} &\\ & = - \sum q(z) log \frac{\frac {p(x, z)}{p(x)}} {q(z)} &\\ & = - \sum q(z) (log \frac {p(x, z)}{q(z)} - log p(x) ) &\\ & = - \sum q(z) log \frac {p(x, z)}{q(z)} + \sum q(z)log p(x) &\\ & = - \sum q(z) log \frac {p(x, z)}{q(z)} + log p(x) \sum q(z) &\\ & = - \sum q(z) log \frac {p(x, z)}{q(z)} + log p(x) \end{aligned}\end{split}\]

So we can rewrite the equation as

\[\begin{split}\begin{equation} log p(x) \\ = KL (q(z) || p(z|x)) + \sum q(z) log \frac {p(x, z)}{q(z)} \\ \end{equation}\end{split}\]

logP(x) is given, if we want to minimize \(KL (q(z) || p(z|x))\), we should maximize \(\sum q(z) log \frac {p(x, z)}{q(z)}\) (we can call it L). This term L is also called variational lower bound. Since KL divergence is always non-negative, logP(x) is always greater or equal to L.

\[\begin{split} \begin{aligned} \sum q(z) log \frac {p(x, z)}{q(z)} &= \sum q(z)log p(x|z) + \sum q(z)log\frac {p(z)}{q(z)} &\\ &= \sum q(z)log p(x|z) - KL (q(z) || p(z)) &\\ &= - KL (q(z) || p(z)) \end{aligned}\end{split}\]

To maximize L we need to make \(E_{q(z)} log p(x|z)\) as great as possible. We can think of \(E_{q(z)} log p(x|z)\) as reconstruction gain. If x is a Gaussian distribution, maximizing \(E_{q(z)} log p(x|z)\) is the same as minimizing \(|x - \hat{x}|^2\). If x is a Bernoulli, it is the same as cross entropy.

q(z) as similar to p(z) as possible and make